Fix shared memory layout mismatch in 2-bit residual kernel#6
Merged
DD-DuDa merged 1 commit intoOpenBitSys:e2efrom May 9, 2026
Merged
Fix shared memory layout mismatch in 2-bit residual kernel#6DD-DuDa merged 1 commit intoOpenBitSys:e2efrom
DD-DuDa merged 1 commit intoOpenBitSys:e2efrom
Conversation
e213239 to
33cffa2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes an invalid shared memory access in the 2-bit residual decode path.
The residual kernel was using
Kernel_traits::SharedStorageto interpret dynamic shared memory, while the launch side allocates shared memory according toKernel_traits::SharedStorage_residual.This mismatch can cause the kernel to compute shared-memory field offsets using the packed/split-kernel layout, even though the actual allocated shared-memory region follows the residual layout. In the 2-bit residual path, this leads to an invalid
__shared__write detected bycompute-sanitizer.Root Cause
Before this patch, the residual kernel used:
However, the residual kernel should use the residual shared-memory layout:
The bug may not always be visible in the 4-bit path because the accessed shared-memory range can happen to stay within the allocated region. In the 2-bit residual path, the residual block/layout pressure is larger and the mismatch triggers an out-of-bounds shared-memory write.
Fix
Change the residual kernel shared-memory alias from:
to:
This makes the shared-memory layout used by the kernel consistent with the shared-memory size allocated at launch time.
Reproduction
After building and installing the package, the issue can be reproduced with:
python evaluation/example.py \ --model_path xxx \ --max_length 8192 \ --num_bits 2 \ --quant_mode k-channel \ --attn_backend bit_decoding